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(54) Method and apparatus for interfacing a processor to a coprocessor 



(57) A processor (12) to coprocessor (14) interface 
supporting multiple coprocessors (14, 16) utilizes com- 
piler generatable software type function call and return, 
instruction execute, and variable load and store inter- 
face instructions. Data is moved between the processor 
(12) and coprocessor (14) on a bi-directional shared 
bus (28) either implicitly through register snooping and 
broadcast, or explicitly through function call and return 
and variable load and store interface instructions. The 



3f 



COPROCESSOR 
24 



INTERFACE 
30 



] COPROCESSOR 
J* 



3£ 



load and store interface instructions allow selective 
memory address preincrementation. The bi-directional 
bus (28) is potentially driven both ways on each clock 
cycle. The interface separates interface instruction 
decode and execution. Pipelined operation is provided 
by indicating decoded instruction discard by negating a 
decode signal before an execute signal is asserted. 
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Descripti n 

Field of the Invention 

[0001] The present invention relates in general to a 
data processing system having a processor and at least 
one coprocessor, and; more particularly, to a method 
and apparatus for. interfacing the processor to the 
coprocessor. 

Background of the Invention 

[0002] The ability to extend a baseline architecture 
processor functionality through dedicated and special- 
ized hardware functional elements is an Important 
aspect of scaleabie and extensible architectures. 
[0003] One of the preferred methods for extending a 
baseline architecture processor functionality is through 
the use of coprocessors. These are dedicated usually 
single purpose processors that operate at the direction 
of a processor. One of the traditional uses of coproces- 
sors was as math coprocessors to selectively provide 
floating point capabilities to architectures that did not 
directly support such. Some example of such math 
coprocessors are the Intel 8087 and 80287. Some other 
potential uses or types of coprocessors include: multi- 
ply-accumulators, modulator/demodulators (modems), 
digital signal processors (DSP), vitturbi calculators, 
cryptographic processors, image processors, and vec- 
tor processors. ~ 
[0004] There have been two different approaches to 
coprocessors. On the one hand, the floating point unit 
for the Digital Equipment Corporation (DEC) PDP-11 
family of computers was very tightly coupled to its pri- 
mary processor. One problem that arose is that this 
tightly coupling required the primary processor to know 
a substantial amount about the operation of the coproc- 
essor. This complicates'circuit design to such an extent 
that addition of a new coprocessor into an integrated* 
system is a major engineering problem. ... 
[0005] The alternative implementation has been to 
loosely couple the coprocessor to the primary proces- 
sor. This did have the advantage of abstracting and iso- 
lating the operation of the coprocessor from the primary 
processor, and thus substantially lessening the effort 
required to integrate a new coprocessor with an existing 
processor. However, this invariably came at a. price. 
Loss of performance is one problem of this approach. 
On problem with taking the type of performance hit 
resulting from this loose coupling is that the break-even 
point for invoking such a coprocessor is increased cor- 
respondingly. Thus, many otherwise attractive applica- 
tions for coprocessors are not cost effective. 
Additionally, such an approach often requires use of a 
bus. with all of the corresponding additional circuitry and 
chip area. 

[0006] It is thus important to have a coprocessor inter- 
face that is tightly coupled enough that usage of the 



interface is fast enough that invoking even fairly simple 
functions is advantageous, while abstracting the inter- 
face to such an extent that the processor architecture is 
isolated from as many of the details of any given coproc- 
5 essor as possible. Part of this later includes making the 
interface programmer friendly in order to facilitate tailor- 
ing new coprocessor applications in software instead of 
in hardware 

io Brief Description of the Drawings 



[0007] The features and advantages of the present 
invention will be more clearly understood from the. fol- 
lowing detailed description taken in conjunction with the 
accompanying FIGURES where like numerals refer to 
like and corresponding parts and in which: 
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FIG. 1 is a block diagram illustrates one embodi- 
ment of a data processing system, in accordance 
with the present invention; 

FIG. 2 is a block diagram that illustrates a portion of 
processor of FIG. 1 ; 

FIG. 3 is a block diagram that illustrates one 
embodiment of a portion of coprocessor of FIG. 1 ; 
FIG. 4 is-a timing diagram that illustrates a register 
snooping operation, in accordance with the present 
invention; 

FIG. 5 is a timing diagram that illustrates the basic 
instruction interface operation for instruction hand- 
shaking; - 

FIG. 6 is a ; timing diagram that illustrates the 
Instruction interface operation when the H^BUSY* 
signal is used to control coprocessor interface 
instruction execution; 

FIG. 7 is a timing diagram that illustrates instruction 
discard; 

FIG. 8 is a timing diagram that illustrates an exam- 
pie of instruction pipeline stall; 
FIG. 9;is a timing diagram that illustrates an exam- 
ple of back-to-back execution with no stalls; 
FIG. . 10 is a. timing diagram that illustrates back-to- 
back operation with internal pipeline stalls; 
FIG. 1 1 is a timing diagram that illustrates back-to- 
back coprocessor interface, 30 instructions with 
H_BUSY* stalls; 

FIG. 12 is a timing diagram that illustrates an exam- 
ple of the H_EXCP* signal being asserted by a 
coprocessor in response to the decode and 
attempted execution of a coprocessor interface 
opcode; 

FIG. 13 is a timing diagram that illustrates an exam- 
ple of the H_EXCP* signal being asserted by a 
coprocessor in response to the decode and 
attempted execution of a coprocessor interface 
opcode when the coprocessor interface instruction 
is discarded; 

FIG. 14 is a timing diagram that illustrates an exam- 
ple where H_BUSY* has been asserted to delay the 
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execution of an coprocessor interface opcode; 
FIG. 1 5 is a timing diagram that illustrates an exam- 
ple of register transfers associated with the 
H_CALL primitive. 

FIG. 16 is a timing diagram that illustrates an exam- 
ple of register transfers associated with the H_RET 
primitive; 

FIG. 17 is a timing diagram that illustrates the 
sequencing of an H_LD transfer to the coprocessor 
interface 

FIG. 18 is a timing diagram that illustrates the pro- 
tocol when a memory access results in an access 
exception; 

FIG. 19 is a timing diagram that illustrates an exam- 
ple of a transfer associated with the H_ST primitive; 
FIG. 20 is a timing diagram that illustrates an exam- 
ple of a transfer with delayed store data; 
FIG. 21 is a timing diagram that illustrates the pro- 
tocol signals when the store results in an access 
error; 

FIG. 22 illustrates an instruction format for the 
H_CALL primitive, in accordance with the present 
invention; 

FIG. 23 illustrates an instruction format for* the 
H_RET primitive, in accordance with the present 
invention; 

FIG. 24 illustrates an instruction format for the 
H_EXEC primitive, in accordance with the present 
invention; 

FIG. 25 illustrates an instruction format for the 
H_LD instruction, in accordance with the present 
invention; and 

FIG. 26 illustrates an instruction format for the 
H_ST instruction, in accordance with the present 
invention. 

Detailed Description 

[0008] In the following description, numerous specif ic 
details are set forth such as specific word or byte 
lengths, etc. to provide a thorough understanding of the 
present invention. However, it will be obvious to those 
skilled in the art that the present invention may be prac- 
ticed without such specific details. In other instances, 
circuits have been shown in block diagram form in order 
not to obscure the present invention in unnecessary 
detail. For the most part, details concerning timing con- 
siderations and the like have been omitted inasmuch as 
such details are not necessary to obtain a complete 
understanding of the present invention and are within 
the skills of persons of ordinary skill in the relevant art. 
[0009] The term "bus" will be used to refer to a plurality 
of signals or conductors which may be used to transfer 
one or more various types of information, such as data, 
addresses, control, or status. The terms "assert" and 
"negate" will be used when referring to the rendering of 
a signal, status bit, or similar apparatus into its logically 
true or logically false state, respectively. If the logically 



true state is a logic level one, the logically false state will 
be a logic level zero. And if the logically true state is a 
logic level zero, the logically false state will be a logic 
level ne. 

5 [0010] FIGL 1 is a block diagram that illustrates one 
embodiment of a data processing system 10 includes a 
processor 12, a coprocessor 14. a coprocessor 16. a 
memory 13, other modules 20 and external bus inter- 
face 22 which are all bidirectionally coupled by way of 

w bus 28. Alternate embodiments of the present invention 
may have only one coprocessor 14, two coprocessors 
14 and 16 or even more coprocessors (not shown). 
External bus interface 22 is bidirectionally coupled to 
external bus 26 by way of integrated circuit terminals 35. 

is Memory 24 is bidirectionally coupled to external bus 26. 
Processor 12 may optionally be coupled external to 
data processing system 10 by way of integrated circuit 
terminals 31. Coprocessor 14 may optionally be cou- 
pled external to data processing system 10 by way of 

20 integrated circuit terminals 32. Memory 18 may option- 
ally be coupled external to data processing system 10 
by way of integrated circuit terminals 32. Other modules 
20 may optionally be coupled external to data process- 
ing system 10 by way of integrated circuit terminals 34. 

25 Processor 1 2 is bidireaionally coupled to both coproc- 
essor 14 and coprocessor 16 by way of coprocessor 
interface 30. 

[001 1 ] FIG. 2 is a block diagram that illustrates a por- 
tion of processor 12 of FIG. 1. In one embodiment proc- 

30 essor 12 includes control circuitry 40, instruction 
decode circuitry 42, instruction pipe 44. registers 46. 
arithmetic logic unit (ALU) 48, latching multiplexer 
(MUX) 50, latching multiplexer (MUX) 52. and multi- 
plexer (MUX) 54. In one embodiment of the present 

35 invention, coprocessor interface 30 includes signals 60- 
71. Clock signal 60 is generated by control circuitry 40. 
Coprocessor operation signals 61 are generated by 
control circuitry 40 and are provided to coprocessors 14 
and 16. ■. i - - . 

40 [0012] Supervisor -mode, signal 62 is generated by. 
control circuitry 40 and is provided to coprocessors 14 
and 16. Decode signal 63 is generated by control cir- 
cuitry 40 and is provided to coprocessor 14 and 16. 
Coprocessor busy signal 64 is received by control cir- 

45 cuitry 40 from coprocessor 14 or coprocessor 16. Exe- 
cute signal 65 is generated by control circuitry 40 and is 
provided to coprocessors 14 and 16. Exception signal 
66 is received by control circuitry 40 from coprocessor 
14 or coprocessor 16. Register write (REGWR*) signal 

so 67 is generated by control circuitry 40 and is provided to 
coprocessors 14 and 16. Register signals (REG{4:0}) 

68 are generated by control circuitry 40 and are pro- 
vided to coprocessors 14 and 16. Error signal (H_ERR*) 

69 is generated by control circuitry 40 and is provided to 
55 coprocessors 1 4 and 1 6. Data strobe signal (H_DS*) 70 

is generated by control circuitry 40 and is provided to 
coprocessors 14 and 16. Data acknowledge signal 
(H_DA*) 71 is received by control circuitry 40 from 
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coprocessor 1 4 or coprocessor 1 6. Hardware data ports 
signal (HDP{31 :0} 72 which are also considered part of 
coprocessor interface 30 are bi-direciional between 
coprocessors 14 and 16 and internal circuitry within 
processor 12. .„ 5 

[001 3] In one embodiment of the present invention a 
plurality of signals a-e provided to or from bus 28 in 
rder to load or store data in memory 1 6 and/or memory 
24. In one embodiment these signals include a transfer 
request signal (TREQ*) 73 that is generated by control 10 
circuitry 40 and provided to bus 28. Transfer error 
acknowledge signal (TEA*) 74 is provided to control cir- 
cuitry 40 by way of bus 28. Transfer acknowledge signal 
(TA*) 75 is provided to control circuitry 40 by way of bus 
28. Instructions are provided from bus 28 to instruction is 
pipe 44 by way of conductors 76. Data is provided to 
MUX 54 by way of conductors 76. Drive Data signal 79 
enables tristate buffer 95 to provide data from latching 
MUX 52 by way of conductors 88 and 76. Address 
Select signal 78 enables latching MUX 50 to provide 20 
addresses to bus 28 by way of conductors 77. Another 
input: to MUX 54 is provided by the HDP signal . 
(HDP{31 :0}) 72. Another input to MUX 54 is provided by 
way of the ALU result conductors 86. The output of 
MUX 54, result signals 83. are provided to registers 46 25 
and to the input of tristate buffer 96. Drive HDP signal 
82 enables tristate buffer 96 to drive result signals 83 on 
the HDP signals 72. The output of tristate buffer 96-is 
also coupled to the input of latching MUX 52. Alternate 
embodiments of the present invention may include any 30 
number of registers in registers 46. Result signals 83 
are provided as an input to latching MUX 50. Result sig- - . 
nais 83 are provided to registers 46 by way of:MUX 54 . . 
Result Select signal (RESULT_SEUECT). 81 1 selects 
which input of MUX 54 is to be driven on result conduc- 35 
tors 83. Source select signal (SOURCE^SELECT) 80 is 
provided to latching MUX 52 to select which signal shall 
b driven to tristate buffer 95 on conductors 88. Control 
circuitry 40 provides control information and > receives 
status information from registers 46 by way of conduc- *o 
tors 91 . Control circuitry 40 provides control' signals and 
receives status signals from arithmetic logic unit 46 by 
way of conductors 92. Control circuitry 40 provides con- 
trol signals and receives status signals from instruction 
pipe 44 and instruction decode circuitry 42 by way of 45 
conductors 93. Instruction pipe 44 is coupled to provide 
instructions to instruction decode circuitry 42 by way of 
conductors 89. Instruction decode circuitry 42 provides 
decoded instruction information to control circuitry 40 by - 
way of conductors 90. Registers 46 provide source so 
operands to arithmetic logic unit 46 by way of conduc- . 
tors 84. Registers 46 provide data to be stored in mem- 
ory 18 or memory 24 by way of conductors 84. latching 
MUX 52, tristate buffer 95 and conductor 76. Register 
46 provide address information to memory 18 or mem- ss 
ory 24 by way of conductors 84, latching MUX 50 and 
address conductor 77. Registers 46 provide a second 
source operand to arithmetic logic unit 48 by way of con- 



ductors 85. 

[0014] FIG. 3 is a block diagram that illustrates one 
embodiment of a portion of coprocessor 14. In one 
embodiment, coprocessor 14 includes control circuitry 
100, computation circuitry 102 and optional storage cir- 
cuitry 104. Control circuitry 100 is bidirectionally cou- 
pled to processor 1 2 by way of coprocessor interface 30 
which includes signals 60-72. In one embodiment of the 
present invention control circuitry 100 includes decode 
circuitry 106 which receives the operation signals 61 
and the decode signal 63 from processor 12. Control 
circuitry 100 provides control information and receives 
status information from optional storage circuitry 104 by 
way of conductors 108. Control circuitry 100 provides 
control information and receives status information from 
computation circuitry 102 by way of conductors 109. 
Computation circuitry 102 and optional storage circuitry 
104 are bidirectionally coupled by way of conductors 
110. One or more of signals 110 may be provided to or 
from bus 28 or integrated circuit terminals 32. Control 
circuitry 100 may receive or provide information to or 
from bus 28 or integrated circuit terminals 32 by way of 
conductors 112. Signals 72 may be bidirectionally cou- 
pled to computation circuitry 102 and optional storage 
circuitry 104. In addition, signals 72 may be bidirection- 
ally coupled to bus 28 or integrated circuit terminals 32. 
In an alternate embodiment of the present invention, 
optional storage circuitry 104 may not be implemented. 
In embodiments of the present invention in which 
optional storage circuitry 104 is implemented, it may be , 
implemented using registers, any type of memory, any. 
type of storage circuit including latches or programma- 
ble logic arrays, etc. In alternate embodiments of the 
present invention, computation circuitry 102 may per- 
form any type of logic or computational function. 
[001 5] The system provides support for task acceler- 
ation by an external coprocessor 1 4 (or hardware accel- 
erator) which is optimized for.specrf ic application related 
operations; These external coprocessors 14, 16 may be 
as simple as s coprocessor 14 for performing a popula- 
tion count, or a more complicated function such as a 
DSP acceleration coprocessor 14 or coprocessor 14 
capable of high speed multiply/accumulate operation. 
[0016] Data is transferred between the processor 12 
and a coprocessor 14 by one or more of several mech- 
anisms as appropriate for a particular implementation. 
These can be divided into transfers to the coprocessor 
14, and transfers from the coprocessor 14. 
[001 7] One of the mechanisms for transferring data to 
a coprocessor 14 is the Register Snooping mechanism, 
which involves no instruction primitive, but is a by-prod- 
uct of normal processor 12 operation. This involves 
reflecting updates to the processor's 12 general pur- 
pose registers ("GPR") 46 across the interface such that 
a coprocessor 1 4 could monitor updates to one or more 
processor 12 registers. This might be appropriate if a 
coprocessor 14 "overlays" a GPR 46 for an internal reg- 
ister or function. In this case, no explicit passing of 
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parameters from the processor 12 to a coprocessor 14 
would be required. 

[0018] Instruction primitives are provided in the base 
processor 12 for explicit transfer of operands and 
instructions between external coprocessors 14, 16 and s 
the processor 12 as well. A handshaking mechanism is 
provided to allow control ever the rate of instruction and 
data transfer. 

[0019] Note that coprocessor 14 functions are 
designed to be implementation specific units, thus the ro 
exact functionality of a given unit is free to be changed 
across different implementations, e/en though the 
same instruction mappings may be present. 
[0020] FIG. 4 is a timing diagram that illustrates a reg- 
ister snooping operation. To avoid the performance is 
overhead of parameter passing to a coprocessor 14 or . 
external monitor, a register snooping mechanism is pro- 
vided. This allows a coprocessor 14 to implement a 
shadow copy of one or more of the processor's 1 2 gen- 
eral registers 46. The capability is implemented by 20 
transferring the value being written into one of the proc- 
essor GPRs 46 and an indication of which register 46 is 
being updated for each GPR update. A strobe signal 
REGWR* 67 is asserted for each register update. The 
value is transferred across the 32-bit bi-directional data 25 
path HDP[31 :0] 72, and a 5-bit register number bus pro- 
vides a pointer to the actual processor, register 46 being 
updated (REG[4:0]) 68. The register number may refer 
to a register 46 in a normal ffle or in an alternate file. In 
the preferred embodirr»€fni, alternate f3e registers are . 30 
indicated by REG[4] == 1, and normal file registers by 
REG[4] == 0. However, note this invention does not 
depend in any way on the actual partitioning of the reg- . 
ssterset. 

[0021 ] A coprocessor 1 4 may latch the value internally 35 
along with an indication. of the destination register -46 
number to avoid an explicit move later. This functionality 
may also be used by a debug coprocessor. 14 to track 
the state of the register file 46 or a subset of it. F»G 4. 
shows an example of the snooping capability. • - * *o 
[0022] A dedicated 12-bit instruction bus 

(H_OP[1 1 :0]) 61 provides the coprocessor interface 30 
opcode being issued to the external coprocessor 14. 
This bus reflects the low order 1 2 bits of the processor's . 
opcode. The high-order four bits are not reflected as 45 
they are always 0b0100. A supervisor mode indicator 
(H_SUP) 62 is also provided to indicate the current 
state of the PSR(S) bit. indicating whether the proces- 
sor is operating in supervisor or user mode. This can be 
useful for limiting certain coprocessor functions to so 
supervisory mode. A set of handshake signals between 
the processor 12 and external coprocessors 14, 16 
coordinate coprocessor interface 30 instruction execu- 
tion. 

[0023] The control signals generated by the processor 55 
12 are a reflection of the internal pipeline structure of 
the processor 1 2. The processor pipeline 44 consists of 
stages for instruction fetch, instruction decode 42, exe- 



cution, and result writeback. It contains one or more 
instruction registers (IR). The processor 12 also con- 
tains an instruction prefetch buffer to allow buffering of 
an instruction prior to the decode stage 42. Instructions 
proceed from this buffer to the instruction decode stage 
42 by entering tfie instruction.deccde register IR. 
[0024] The instruction decoder 42 receives inputs 
from the IR. and generates outputs based on the value 
held in the IR. These decode 42 outputs are not always 
valid, and may be discarded due to" exception conditions 
or changes in instruction flow. Even when valid, instruc- 
tions may be held in the IR until they can proceed to the 
execute stage of the instruction pipeline. Since this can- 
not occur . until previous instructions: have completed 
execution (which may take multiple docks), the decoder 
will continue to decode the value contained in the IR 
until the I R is updated. 

[0025] . FIG. 5 is a -timing diagram that illustrates the 
basic instruction interface operation for - instruction 
handshaking. An instruction decode strobe* (H_DEC*) 
signal 63 is provided to indicate the decode of an 
coprocessor interface 30 opcode by the processor 12. 
This signal will be asserted when a coprocessor inter- 
face 30 opcode resides in the IR. even if the instruction 
may be discarded without execution. The H_DEC* 63 
output may remain asserted for multiple clocks for the 
same instruction until the instruction is actually issued 
or is discarded. 

[0026] A busy signal,(H_BUSY*) 64 is monitored by 
the processor 12,to determine if an external coproces- 
sor 14 can accept the coprocessor interface 30 instruc- 
tion,, and partially controls when issuance of the 
instruction occurs. If the H.BUSY* 64 signal is negated 
while HJDEC* 63 is asserted, instruction execution will 
not be stalled by the interface, and the H_EXEC* 65 sig- 
nal may assert as soon as . instruction execution can 
proceed. If the H_BUSY* 64 signal is asserted when the 
processor 12 decodes an coprocessor interface 30 
opcode (indicated by the, assertion of H_D EC* 63), exe- 
cution of the coprocessor interface 30 opcode will .be 
forced to stall. Once the H_BUSY; 64 signal is negated, 
the processor 1 2 may issue the instruction by asserting 
H_EXEC* 65. If a coprocessor 1 4 is capable of buffering 
instructions, the H_BUSY* 64 signal may be used to 
assist f illing.of the buffer. 

[0027] FIGL 6 is a timing diagram that illustrates the 
Instruction interface operation when H_BUSY* 64 is 
used to control coprocessor interface 30 instruction exe- 
cution. Once any internal stall condition has been 
resolved, and the H_BUSY* 64 signal has been 
negated, the processor can assert H_EXEC* 65 to indi- 
cate that the coprocessor interface 30 instruction has 
entered the execute stage of the pipeline. An external 
coprocessor 14 should monitor the H_EXEC* 65 signal 
to control actual execution of the instruction, since it is 
possible for the processor to discard the instruction prior 
to execution in certain circumstances. If execution of an 
earlier instruction results in an exception being taken, 
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the H_EXEC* 65 signal will not be asserted, and the 
HJDEC* 63 output will be negated. A similar process 
can occur if the instruction in the IR is discarded as the 
resiiit of a change in program flow. 
[0028] FIG. 7 is a timing diagram that illustrates 
instruction discard. If an instruction is discarded, the 
H_DEC* 63 signal win be negated before another 
coprocessor interface 30 opcode is placed on the 
H_OP[11:0] 61 bus. 

[0029] FIG. 8 is a timing diagram that illustrates an 
example of instruction pipeline stall. There are circum- 
stances where the processor 12 may delay the asser- 
tion of H_EXEC* 65 even though HJDEC* 63 is 
asserted and H_BUSY* 64 is negated. This can occur 
while waiting for an earlier instruction to complete. 
[0030] FIG. 9 is a timing diagram that illustrates an 
example of back-to-back execution with r.o stalls. For 
back-to-back coprocessor interface 30 instructions, the 
H_DEC* 63 signal can remain asserted without nega- 
tion, even though the H_OP[1 i :0] 61 bus is updated as 
new instructions enter the IR. In general, the assertion 
of H__EXEC V 65 corresponds to execution of the instruc- 
tion being decoded on the preyious clock. 
[0031] FIG. 10 is a timing diagram that illustrates 
back-to- back operation with internal pipeline stalls. In 
this case. H_BUSY* 64 is negated, but the processor 
does not assert H_EXEC* 65 for the second coproces- 
sor interface 30 instruction until the internal stall condi- 
tion disappears. 

[0032] FIG. 11 is a timing diagram that illustrates 
back-to-back coprocessor interface 30 instructions with 
H_BUSY* 64 stalls. In thrs example, the external 
coprocessor 14 is busy, and cannot accept the second 
instruction immediately. H_BUSY* 64 asserts to prevent 
the second instruction from being issued by the proces- 
sor 12. Once the coprocessor 14 becomes free, 
H_BUSY* 64 is negated, and the next coprocessor 
interface 30 instruction advances to the execute stage.. 
[0033] Exceptions related to the decode of ah coproc-: 
essor interface 30 opcode may be signaled by an exter- 
nal coprocessor 14 with the H_EXCP* 66 signal. This 
input to the processor 12 is sampled during the clock 
cycle that H_DEC* 63 is asserted and H_BUSY* 64 is 
negated, and will result in exception processing for a 
Hardware Coprocessor 14 Exception rf the coprocessor 
interface 30 opcode is not discarded as previously 
described. Details of this exception processing are 
described below. 

[0034] FIG. 12 is a timing cfiagram that illustrates an 
example of the H_EXCP* 66 signal being asserted by a 
coprocessor 14 in response to the decode and 
attempted execution of a coprocessor interface 30 
opcode. The H_EXCP* 66 signal is sampled by the 
processor 12 during the dock that H_DEC* 63 is 
asserted and H_BUSY* 64 is negated. The H_EXEC* 
65 signal is asserted regardless of whether an excep- 
tion is signaled by the interface; this assertion distin- 
guishes the exception taken case from the instruction 



discard case. 

[0035] Note that the exception corresponds to the 
instruction being decoded the previous dock cyde, and 
that no actual execution should take place. A coproces- 

5 sor 14 must accept the offending instruction and signal 
an exception prior to the execute stage of the processor 
ppefine for it to be recognized. The H_EXCP* 66 signal 
is ignored for a!) dock cycles where HJDEC* 63 is 
negated or H_BUSY* 64 is asserted. 

w [0036] FIG. 13 is a timing diagram that illustrates an 
example of the H_EXCP* 66 signal being asserted by a 
coprocessor 14 in response to the decode and 
attempted execution of a coprocessor interface 30 
opcode. Contrasting this with the timing diagram in FIG. 

is 14, in this example, the coprocessor interface 30 
instruction is discarded, so the H_EXEC* 65 signal is 
not asserted, and the H_DEC* 63 is negated . 
[0037] FIG. 14 is a timing diagram that illustrates an 
example where H_BUSY* 64 has been asserted to 

20 delay the execution of an coprocessor interface 30 
opcode which wiQ result in an exception. 
[0038] The H_BUSY* 64 and H_EXCP* 66 signals are 
shared by all coprocessors 14, 16, thus they must be 
drivenin a coordinated manner. These signals should 

25 be driven (either high or low. whichever is appropriate) 
by the coprocessor 14. 16 corresponding to 
H_OP[11 :10] 61 on dock cycles where H_J)EC* 63 is 
asserted. By driving the output only during the low por- 
tion of the clock, these signals may be shared by multi- 

30 pie coprocessors -14, 16 without contention. A holding 
latch internal to the processor 12 is provided on this 
input to hold it in a valid state for the high phase of the 
dock while no unit is driving it 

[0039] j Some of the coprocessor interface 30 instruc- 

35 tion primitives also imply a transfer of data items 
between the processor 1 2 and an external coprocessor 
14. Operands may be transferred across the coproces- 
sor interface 30 as a function of the particular primitive 
being executed. Provisions are made for transferring 

40 one or more of the processor 12 GPRs either to or from 
coprocessor 14 across a 32-bit bidirectional data path. 
In addition, provisions are also made to load or store a 
single data item from/to memory -18 with the data 
sink/source being the coprocessor interface 30. The 

45 processor 1 2 will pass parameters to external coproces- 
sors 14, 16 via the HDP[31:0] 72 . bus during the high 
portion of CLK 60, operands are received and latched 
from the coprocessor interface 30 by the processor 12 
during the low phase of the dock. A delay is provided as 

so the dock transitions high before drive occurs to allow for 
a small period of bus hand-off. A coprocessor 14 inter- 
face must provide the same small delay at the falling 
dock edge. Handshaking of data items, is supported 
with the Data Strobe (HJDS* 70) output, the Data 

55 Acknowledge (H_DA* 71) input, and the Data Error 
(H_ERR* 69) output signals. 

[0040] The processor 12 provides the capability of 
transferring a . list of call or return parameters to the 



6 



11 



EP 0 901 071 A2 



12 



coprocessor interface 30 in much the same way as soft- 
ware subroutines are called or returned from. A count of 
arguments is indicated in the H_CALL or H_RET primi- 
tive to control the number of parameters passed. Regis- 
ter values beginning with the content of processor 12 5 
register R4 are transferred to (from) the external coproc- 
ssor 14 as part of the execution of the H_CALL 
(H_RET) primitive- Up to seven register parameters 
may be passed. This convention is similar to the soft- 
ware subroutine calling convention. 10 
[0041 ] Handshaking of the operand transfers are con- 
trolled by the Data Strobe (H_DS* 70) output and Data 
Acknowledge (H_DA* 71) input signals. Data Strobe will 
be asserted by the processor 12 for the duration of the 
transfers, and transfers will occur in an overlapped man- 15 
ner. much the same as the processor 12 interface oper- 
ation. Data Acknowledge (H_DA*) 71 is used to indicate 
that a data element has been accepted or driven by a 
coprocessor 14. 

[0042] FIG. 1 5 is a timing diagram that illustrates an 20 
example of register 46 transfeis associated with the 
H.CALL primitive. Instruction primitives are provided to 
transfer multiple processor registers and the transfers 
can ideally occur every clock. For transfers to an exter- 
nal coprocessor 14. the processor will automatically 25 
begin driving the next operand (if needed) prior to (or 
concurrent with) the acknowledge of the current item. 
External logic must be capable of one level of buffering 
to ensure no loss of data. This FIG. shows the sequenc- 
ing of an H_CALL transfer to the coprocessor interface 30 
30. where two registers are to be transferred. The sec- 
ond transfer is repeated due to a • negated 1 Date 
Acknowledge (H_DA*) 71. ' 
[0043] For transfers from an external coprocessor 14 
to processor registers 46. the processor 12 is capable of 35 
accepting values from an external coprocessor 1 4 every 
clock cycle after H_DS^ 70 has been asserted, and 
these values are written into the register file 46 as they 
are received, so no buffering is required. 
[0044] FIG. 16 is a timing diagram that- illustrates an .40 
example of register 46 transfers associated with the 
H_RET primitive. In this example; two register 46 values 
are transferred. The coprocessor 14 may drive data 
beginning with the clock following the assertion of the 
H_EXEC* 65 signal, as this is the clock where H_DS* 45 
70 will first be asserted. The H_DS* 70 output transi- 
tions with the rising edge of CLK 60. while the H_DA* 71 
input is sampled during the low phase of CLK 60. 
[0045] The processor 12 provides the capability of 
transferring a single memory operand to or from the so 
coprocessor interface 30 with the H_LD or H_ST 
instruction primitives. 

[0046] The H _LD primitive is used to transfer data 
from memory 18 to a coprocessor 14. Handshaking of 
the operand transfer to the coprocessor 1 4 is controlled 55 
by the Data Strobe (H_DS*) 70 signal. Data Strobe will 
be asserted by the processor 12 to indicate that a valid 
operand has been placed on the HDP[31 :0] 72 bus. The 



Data Acknowledge (H_DA*) 71 input is ignored for this 
transfer. 

[0047] FIG. 1 7 is a timing diagram that illustrates the 
sequencing of an H_LD transfer to the coprocessor 
interface 30. In this case, there is a no-wait state mem- 
ory 18 access. For memory 18 accesses with n wait- 
states, the operand and H_DS* 70 would be driven n 
docks later. If the option to update the base register 46 
with the effective address of the load is selected, the 
update value is driven on HDP[31 ft] 72 the first clock 
after it has been calculated (the clock following the 
assertion of H_EXEC* 65). 

[0048] FIG. 1 8 is a timing diagram that illustrates the 
protocol when a memory 1 8 access results in an access 
exception. In such a case, the H_ERR* 69 signal is 
asserted back to the external coprocessor 14. 
[0049] The H_ST primitive can be used to transfer 
data to memory 18 from a coprocessor 14. ff the option 
to update the base register 46 with the effective address 
of the store is selected, the update value is driven on 
HDP[31 :0] 72 the first clock after it has been calculated 
(the clock following the assertion of .H_EXEC: 65). 
[0050] FIG. 19 is a timing diagram that illustrates an 
example of a transfer associated with the H_ST primi- 
tive- The handshake associated with the H_ST primitive 
consists of two parts, an initial handshake from the 
coprocessor 14. which must provide data for the store, 
and a completion handshake from the processor 12 
once the store to memory 1 8 has completed. 
[0051] The initial handshake uses the H_DA* 71 input 
to the processor 12 to signal that the coprocessor 14 
has driven store data to the processor 12. The H_DA* 
71 signal is asserted the same clock that data is driven 
onto the HDP[31 :0] 72 bus by the coprocessor 14. The 
store data is taken from the lower half of the bus for a 
halfword sized.store. the upper 16 bits will not be written 
into memory 18. The H_DA* 71 signal will be sampled 
beginning with the.clqck^the H_EXEC* 65 signal is 
asserted. The. memory cycle is requested during the 
dock where H^DA* 71 is recognized. and store data will 
be driven to memory 1 8 on the following dock. Once the 
store has completed, the processor 12 will assert the 
H_DS* 70 signal. 

[0052] FIG. 20 is a timing diagram that illustrates an 
example of a transfer with delayed store data. 
[0053] FIG. 21 is a timing diagram that illustrates the 
protocol signals when the store results in an access 
error. Note here that the H_ERR* 69 signal is asserted. 
If the hardware unit aborts the instruction by asserting 
H_EXCP* 66 the dock where H_EXEC* 65 is asserted, 
the H_DA* 71 signal should not be asserted. 
[0054] FIGs. 22 through 26 illustrate instructions pro- 
vided as part of the instruction set to interface to aHard- 
ware Accelerator (or coprocessor) 14. The processor 12 
interprets some of the fields in the primitives, others are 
interpreted by the coprocessor 14 alone. 
[0055] FIG. 22 illustrates an instruction format for the 
H_CALL primitive. This instruction is used to "call" a 
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function implemented by a coprocessor 14. The para- 
digm is similar to a standard software calling conven- 
tion, but in a hardware context. The H_CALL primitive is 
interpreted by both the processor 12 and the coproces- 
sor 1 4 to transfer a list of "call parameters" or arguments 
from the processor 12 and initiate a particular function 
in the coprocessor 14. 

[0056] The UU and CODE fields of the instruction 
word are not interpreted by the processor 12 , these are 
used to specify a coprocessor 14 specific function. The 
UU field may specify a specific coprocessor 14, 16, and 
the CODE field may specify a particular operation. The 
CNT field is interpreted by both the processor 12 and 
the coprocessor 14, and specifies the number of regis- 
ter arguments to pass to the coprocessor 14. 
[0057] Arguments are passed from the general regis- 
ters 46 beginning with R4 and continuing through 
R(4+CNT - 1). Up to seven parameters or registers 46 
may be passed in a single H_CALL invocation. 
[0058] The H_CAUL instruction can be used to imple- 
ment modular module invocation. Usage of this type of 
interface has long been known to result in software sys- 
tems with higher reliability and fewer bugs. Function 
parameters are usually best passed by value. This sig- 
nificantly reduces side-effects. In many cases; modern 
compilers for block-structured languages such as C arid 
C++ pass short sequences of parameters or arguments 
to invoked functions or subroutines in registers 46. This 
technique can be implemented with the-H_CALL 
instruction. A compiler can be configured to load up to 
seven parameters or arguments into successive regis- 
ters 46 starting at R4, then generating the K_CALL 
instruction, which replaces the standard compiler gen- 
erated subroutine linkage instruction. ; 
[0059] FIG. 23 illustrates an instruction format for the 
H_RET primitive. This instruction is used to ^return 
from" a function implemented by a coprocessor 14. The 
paradigm is similar to the software calling convention 
used by the processor 12, but in a hardware context 
The H_RET primitive is interpreted by berth the proces- 
sor 1 2 arid the coprocessor 1 4 to transfer a list of ' return 
parameters" or values to the processor 12 from a 
coprocessor 14. 

[0060] The UU and CODE fields of the instruction 
word are not interpreted by the processor 12 ,these are 
used to specify a coprocessor 14 specific function. The 
UU field may specify a hardware unit, and the CODE 
f ield may specify a particular operation or set of regis- 
ters 46 in the coprocessor 14 to return. The CNT field is 
interpreted by both the processor 12 and the coproces- 
sor 14, and specifies the number of register 46 argu- 
ments to pass from the coprocessor 1 4 to the processor 
12. 

[0061] Arguments are passed to the processor 12 
general registers 46 beginning with R4 and continuing 
through R(4+CNT-1). Up to seven parameters (or regis- 
ter contents) may be returned. 

[0062] As with the H_CALL instruction, the H_RET 



instruction can also be used to implement modular pro- 
gramming. Structured programming requires that func- 
tion return values are best passed back to a calling 
routine by value. This is often done efficiently by compfl- 

s ers by placing one or more return values in registers for 
a subroutine or function return. It should be* noted 
though that traditional structured programming expects 
a subroutine or function to return immediately after the 
subroutine or function invocation. In the case of coproc- 

w essors 1 4, execution is often asynchronous with that of 
the invoking processor 12. The H_RET instruction can 
be used to resynchronize the processor 12 and coproc- 
essor 14. Thus, the processor 12 may load one or more 
registers 46, activate the coprocessor 14 with one or 

is more H_CALL instructions, execute unrelated instruc- 
tions, and then resynchronize with the coprocessor 14 
while receiving a resulting value or values from the 
coprocessor 14 by issuing the H_RET;instruction. 
[0063] FIG. 24 illustrates an instruction format for the 

20 H_EXEC primitive. This instruction is used to initiate a 
function or enter an operating mode implemented by an 
Accelerator. -The HJEXEC instruction can be used to 
control a function in a specific coprocessor 14, 16 spec- 
ified by a UU field. The code field is not interpreted by 

25 the processor 12 but is rather reserved for the desig- 
nated coprocessor 14, 16. The UU and CODE fields of 
the instruction word are not interpreted by the processor 
1 2, these are used to specify a coprocessor 1 4 specific 
function. The UU field may specify a specific coproces- 

30 sor 1 4, 1 6, arid the CODE field may specify a particular 
operation. ; - • t • .. 

[0064] FIG. 25 illustrates an instruction format for the 
H_LD instruction. This instruction' is used to pass a 
value from memory 1 8 to a coprocessor 1 4 without tem- 

35 porarily storing the memory operand in a General Pur- 
pose Register (GPR) 46. The memory operand is 
addressed using a base pointer and an offset. 
[0065] The H_LD instruction performs a load of a 
value in memory 18, and passes the memory operand 

40 to the coprocessor 14 without storing it in a register 46. 
The H_LD operation has three options, w- word, h- half 
word and u- update. Disp is obtained by scaling the 
IMM2 field by the size of the load, and zero-extending. 
This value is added to the value of Register RX and a 

45 load of the specified size is performed from this 
address, with the result of the load passed to the hard- 
ware interface 28. For halfword loads, the data fetched 
is zero-extended to 32 -bits. If the .u option is specified, 
the effective address of the load is placed in register RX 

so 46 after it is calculated. 

[0066] The UU field of the instruction word is not inter- 
preted by the processor 12, this field may specify a spe- 
cific coprocessor 14, 16. The Sz field specifies the size 
of the operand (halfword or word only). The Disp field 

ss specifies an unsigned offset value to be added to the 
content of the register specified by the Rbase field to 
form the effective address for the load. The value of the 
Disp field is scaled by the size of the operand to be 
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transferred. The Up field specifies whether the Rbase 
register 46 should be updated with the effective address 
of the load after it has been • calculated. This option 
allows an "auto-update" addressing mode. 
[0067] FIG. 26 illustrates an instruction format for the s 
H_ST instruction. This instruction is used to pass a 
value from a coprocessor 14 to memory 18 without tem- 
porarily storing the memory operand in a processor 12 
register 46. The memory operand is addressed using a 
base pointer and an offset. 10 
[0068] The UU field of the instruction word is not inter- 
preted by the processor 12. Rather this field may specify 
a specific coprocessor 1 4, 1 6. The Sz field specifies the 
size of the operand (harfword or word only). The Disp 
field specifies an unsigned offset value to be added to 75 
the content of the register 46 specified by the. Rbase 
field to form the effective address for the store. The 
value of the Disp field is scaled by the size of the oper- 
and to be transferred. The Up field specifies whether the 
Rbase register should be updated with the effective 20 
address of the store after it has been calculated. This 
option allows an "auto-update" addressing mode. 
[0069] The H_ST instruction performs a store to mem- 
ory 18. of an operand from a coprocessor 14 without 
storing it in a register 46. The H_ST operation has three : 25 
options, w- word, h- half word and u^ update; Disp is 
obtained by scaling the IMM2 field by the size of the 
store and zero-extending. This value is added to the. . 
value of Register RX and a store of the specified size is 
performed to this address, with the .data for the. store 30 
obtained from the hardware interface. If the .u option is 
specified, the effective address of the load; is placed in 
register RX after it is calculated. 
[0070] The H_LD instruction and the H_ST instruction 
provide an efficient mechanism to move operands from 35 
memory 18 to a coprocessor 14 and from a coprocessor 
14 to memory 18 without the data being moved routing 
through registers 46. The offset and indexing provisions » 
provide a mechanism for eff iciently . stepping through : 
arrays. Thus, these instructions are especially useful 40 
within loops. Note should be made that both instructions 
synchronize the processor 1 2 with the coprocessor 1 4 
for every operand loaded or stored. If this is not neces- 
sary or even preferred, one may alternatively stream 
data to the coprocessor 1 4 by repeatedly loading a des- 45 
ignated register or registers 46 with data from memory 
18, and have the coprocessor 14 detect these loads 
since the coprocessor interface bus 30 is also used for 
register snooping. 

[0071 ] Those skilled in the art will recognize that mod- so 
rf ications and variations can be made without departing 
from the spirit of the invention. Therefore, it is intended 
that this invention encompass all such variations and 
modifications as fall within the scope of the appended 
claims. 55 



Claims 

1. In a. processor (12) adapted to cooperate with a 
coprocessor (14.16) coupled thereto via a commu- 
nication bus (30). in an execution of at least one 
instruction (H_CALL) comprising a count field and a 
code field, a method for executing said instruction 
comprising the steps of : 

receiving said instruction; 
providing to said coprocessor, via a first cycle 
on said communication bus,, said count and 
code fields; 

if the count field has a value, n, greater than 
zero, providing to the coprocessor, via a sec- 
ond cycle on said communication bus, a first 
operand; and : 
. completing said instruction. 

2. The method of claim 1 further comprising the steps 
of: 

receiving a first input signal (64) from the 
coprocessor via said communication bus dur- 
- ing said-first cycle; 

, wherein, if the first input signal received from 
the coprocessor, via the communication bus 
during said first cycle has a first state, then 
repeating said first cycle, wherein the step of 
repeating said first cycle includes: 
providing to said coprocessor, via said first 
cycle on said communication bus, said count 
arid code fields; and 

receiving said first input signal from the coproc- 
< essor via said communication bus during said 
first cycle; 

and wherein, if the first input signal received 
fronv the* coprocessor via the communication 
/bus during said £ first cycle has a second state, 
. then providing <to. said coprocessor, via said 
second cycle on said communication bus, a 
first output signal (65). 

3. The method of claim 1 further comprising the steps 
of: 

if a count field value, n, is greater than one, 
then, on each of (n-1) cycles on said communi- 
cation bus: 

providing to the coprocessor, via said com- 
munication bus. a next one of (n-1) oper- 
ands; and 

receiving from the coprocessor, via said 
communication bus. a second input signal. 

4. The method of claim 1 wherein the processor is 
adapted to cooperate with a plurality of coproces- 
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sors coupled thereto via said communication bus in 
the execution of said instruction, the method further 
comprising the step of: 

providing to said plurality of coprocessors, via s 
said communication bus during said first cycle, 
an identif ier field having a value which uniquely 
identifies a selected one of said plurality of 
coprocessors. 

w 

5. The method of claim 1 wherein the processor 
includes a plurality of registers for storing selected 
operands, and wherein the step of providing said 
first operand to said coprocessor via said communi- 
cation bus during said second cycle is further char- is 
acterized as: 

if the count field has a value, n, greater than 
zero, providing to the coprocessor, via a sec- 
ond cycle on said communication bus, an oper- 20 
and stored in a predetermined one of said 
plurality of registers. 

6. In a processor (12) adapted to cooperate with a 
coprocessor (14, 16) coupled thereto via a commu- 25 
nication bus (30) in an execution of at least one 
instruction (H_LD) comprising an effective address 
calculation field, a method for executing said 
instruction comprising the steps of : 

30 

receiving said instruction; 
providing to said coprocessor, via a first cycle 
on said communication bus, said effective ! 
address calculation field; 

calculating an effective address in accordance 35 
with said effective address calculation field; 
fetching an operand stored at said calculated 
effective address; 

providing to the coprocessor, via a second 
cycle on said communication bus, said fetched 4c] 
operand; and 

completing said instruction. 



7. The method of claim 6 further comprising the steps 
of: 



45 



receiving a first input signal (64) from the 
coprocessor via said communication bus dur- 
ing said first cycle; 

wherein, if the first input signal received from so 
the coprocessor via the communication bus 
during said first cycle has a first state, then 
repeating said first cycle, wherein the step of 
repeating said first cycle includes: 
providing to said coprocessor, via said first ss 
cycle on said communication bus, said effec- 
tive address calculation field; and 
receiving said first input signal from the coproc- 



essor via said communication bus during said 
first cycle; 

and wherein, if the first irput signal received 
from the coprocessor via the communication 
bus during said first cycle has a second state, 
then providing to said coprocessor, via said 
second cycle on said communication bus, a 
first output signal (65). 

8. The method of claim 6 wherein the processor 
includes a plurality of registers for storing selected 
operands, wherein the effective address calculation 
field comprises a base register designator subfield 
and a displacement subfield, and wherein the step 
of calculating said effective address in accordance 
with said effective address calculation field includes 
adding the contents of said displacement subfield 
to the contents of the one of said plurality of regis- 
ters designated by the contents of said base regis- 
ter designator subfield. 

9. In a processor (12) adapted to cooperate with a 
coprocessor (14, 16) coupled thereto via a commu- 
nication bits (30) in an execution of at least one 
instruction (H_ST) comprising an effective address 
calculation field, a method for executing said 
instruction comprising the steps of : 

receiving said instruction; 
providing to said coprocessor, via a first cycle 
on said communication bus, said effective 
.address calculation field; 
calculating an ieffective address in accordance 
with said effective address calculation field; 
receiving from the coprocessor, via a second 
cycle on said communication bus, an operand; 
storing said received operand at said calcu- 
lated effective address; and 
completing said instruction. 

1.0. The method of claim 9 wherein the processor 
includes a plurality of registers for storing selected 
operands, wherein the effective address calculation 
field comprises a base register designator subfield 
(Rx) and a displacement subfield (IMM2) and 
wherein the step of calculating said effective 
address in accordance with said effective address 
calculation field includes adding the contents of 
said displacement subfield to the contents of one of 
said plurality of registers designated by the con- 
tents of said base register designator subfield. 
wherein the effective address calculation field 
includes an update field (UP), and wherein the step 
of calculating said effective address in accordance 
with said effective address calculation field includes 
storing said calculated effective address in said 
designated one of said plurality of registers if said 
update field has a first value. 
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